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Abstract - We present two complementary analytical approaches for calculating the distribution 
of shortest path lengths in Erdos-Renyi networks, based on recursion equations for the shells around 
a reference node and for the paths originating from it. The results are in agreement with numerical 
simulations for a broad range of network sizes and connectivities. The average and standard devi¬ 
ation of the distribution are also obtained. In the case that the mean degree scales as N a with the 
network size, the distribution becomes extremely narrow in the asymptotic limit, namely almost all 
pairs of nodes are equidistant, at distance d = [1/aJ from each other. The distribution of shortest 
path lengths between nodes of degree m and the rest of the network is calculated. Its average is 
shown to be a monotonically decreasing function of m, providing an interesting relation between 
a local property and a global property of the network. The methodology presented here can be 
applied to more general classes of networks. 


The increasing interest in network research in recent years is motivated by the realiza¬ 
tion that a large variety of systems and processes which involve interacting objects can be 
described by network models HEj. In these models, the objects are represented by nodes 
and the interactions are expressed by edges. Pairs of connected nodes can affect each other 
directly. However, the interactions between most pairs of nodes are indirect, mediated by 
intermediate nodes and edges. Important properties of these indirect interactions such as 
their strengths, delay times, coordination, correlation and synchronization depend on the 
paths between different nodes. A pair of nodes, i and j, may be connected by a large number 
of paths. The shortest among these paths are of particular importance because they are 
likely to provide the fastest and strongest interaction between these two nodes. Therefore, 
it is of interest to study the distribution of shortest path lengths (DSPL) between nodes 
in different types of networks. Such distributions are expected to depend on the network 
structure and size. 

Random networks of the Erdos-Renyi (ER) type were studied extensively since the 1950’s 
E5H3 using mathematical methods and computer simulations [Sj . The increasing availability 
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of empirical data on networks since the late 1990’s stimulated much theoretical interest, 
leading to new results for ER networks mm- Measures such as the diameter and the 
average path length were studied extensively Earn However, apart from a few studies, the 
entire DSPL has attracted little attention mm- This distribution is of great importance 
for the temporal evolution of dynamical processes on networks, such as signal propagation, 
navigation and epidemic spreading M- It determines the number of nodes exposed to a 
propagating signal originated from a given node as a function of time. More generally, the 
shortest paths can be considered as the backbone of a more complete set of paths between 
pairs of nodes. While the shortest paths provide the fastest propagation, signals also utilize 
longer paths which are more numerous. This was demonstrated in studies of first passage 
times in diffusive processes on networks Ell¬ 
in this Letter we present two analytical approaches for calculating the DSPL between 
nodes in the ER network, referred to as the recursive shells approach (RSA) and the recursive 
paths approach (RPA). Using recursion equations we study this distribution in different 
regimes, namely sparse and dense networks of small as well as asymptotically large sizes. 
Consider an ER network of TV nodes, where each pair of nodes is independently connected 
with probability p. We denote such a network by ER (N,p). Its degree sequence follows the 
Poisson distribution with the parameter Np, which is equal to the average degree. Such 
networks are often studied in the asymptotic limit, where TV —> oo. In this limit, one can 
identify different regimes, according to the scaling of p vs. TV. 

For sparse networks, denoted by ER(TV, c/TV), the average degree is c = Np. At c = 1 
there is a percolation transition. For c < 1, the network consists of small isolated clusters. 
For c > 1, a giant component of size which scales linearly with N is formed, in addition to 
the small, isolated components of maximal size which scales as In TV [5j. For dense networks, 
the parameter p scales as TV“ _1 , where 0 < a < 1, the mean connectivity grows with the 
network size as N a and the number of isolated components vanishes. 

When a pair of nodes resides on the same connected sub-network, one can identify paths 
connecting these nodes. The path length is the number of edges along the path. The distance 
dij between a pair of different nodes i and j is the length of the shortest path connecting 
them. When i and j reside on different sub-networks, there is no path between them and 
thus = oo. The tail-distribution F/v(fc) = Pr(d > k), k = 0,1,2,..., AT — 1, is the 
probability that the distance d between a random pair of nodes in an ER network of size N 
is larger than k. Clearly, the probability that two distinct random nodes are at a distance 
d > 0 from each other is -F’at(O) = 1, while the probability that d > 1, namely they are not 
directly connected, is F/v( 1) = q , where q = 1 — p. The probability distribution P/v(fc) can 
be recovered as P/v(fc) = Fff(k — 1) — Fjv(fc), k = 1,2,..., N— 1. The probability FV(fc) does 
not necessarily converge to zero in the limit k —> oo. Its asymptotic value F(oo) is equal to 
the fraction of pairs of nodes in the network which belong to different clusters, namely for 
which d^ = oo. In fact, F( oo) can be estimated independently by using known properties 
of the fraction of nodes, g , which belongs to the giant component in the asymptotic limit 
[8]. This fraction satisfies g = 1 — exp(— eg) and F( oo) = 1 — g 2 . In a finite network F{oo) 
can be replaced by Fjv(TV — 1) since the longest possible distance is d = N — 1. 

In the RSA, one picks a random node, i , as a reference node and examines the shell 
structure of the rest of the network around it. The number of nodes which are at a distance 
d > k, k = 0,1, 2,..., TV — 1, from the reference node is denoted by TV*,. The number of 
nodes at distance d = k from the reference node is denoted by TV*,, where Nq = 1 and TV*, = 
Nk-i — TVfc for k > 1. The TV*,’s obey the recursion equation TVfc + i = TV*,(1 — q Nk ), which 
can be re-written as a second order difference equation of the form Nk+i = Nf : q Nk ~ 1 ~ Nk , 
where Nq = N — 1 and TVi = (TV — 1 )q. Using the relation TV*, = (TV — 1) • F^{k), it can be 
expressed as 

F N {k + 1) = Fjv(fc)g (JV_1)[Fiv(fc_1) “ Fitf(fc)1 , (1) 

where -Fv(O) = 1 and Fjv(l) = q. 
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In the RPA one first picks two distinct random nodes, i and j. The probability that the 
distance between them is larger than k can be related to the probability that it is larger 
than k — 1 by F N (k) = F N (k — l)P N (d > k\d > k — 1), where P/v(d > k\d > k — 1) is the 
conditional probability that the distance is larger than k , given that it is larger than k — 1. 
The iteration of this relation yields 

k 

FN (k)=F N (i) n P/v(d > m\d > m — 1). (2) 

m=2 

This means that in order to obtain the distribution .F)v(fc), all we need to calculate are the 
conditional probabilities P/v(d > m\d > m — 1), for all values of 2 < m < k. 

Consider a path of length k starting at node i and ending at node j (assuming that there 
is no such path of length k— 1 or less). The path can be decomposed into a single edge from 
node i to an intermediate node £ and a shorter path of length k — 1 from £ to j. Such a 
path can be ruled out in two ways: either there is no edge between i and £ (with probability 
q), or, in case that there is such an edge - there is no path of length k — 1 between £ and 
j. The probability of the latter is P/v_i(d > k — l|d > k — 2), since the remaining path is 
embedded in a smaller network of N — 1 nodes. Combining the two possibilities yields the 
recursion equation 

Pjv(d > k\d > k — 1) = [q + p ■ Pjy-i(d > k — l\d > k — 2)]^ 2 , (3) 

where the right hand side is raised to the power IV — 2 in order to account for all possible 
ways to choose the intermediate node £. In Fig. 1 we present the possible paths of length k 
between i and j. This approach follows the spirit of the renormalization group theory |18j . 
since the removal of a node from the network reduces the size of the configuration space by 
a factor of 2 Ar_1 . This process is repeated k — 1 times, reducing the network down to size 
N' = N — k + 1 and closing the recursion equations with PN'(d > l|d > 0) = Pjv'(1) = q . 



Fig. 1: (Color online) Illustration of the possible paths of length k between two random nodes i 
and j in an ER network of N nodes. The first edge of such path connects node i to some other 
node £, which may be any one of the remaining N — 2 nodes. The rest of the path, from £ to j is 
of length k — 1 and it resides on a smaller network of N — 1 nodes. The path of length k from i to 
j exists only when both the edge from i to £ and the path of length k — 1 from £ to j exist. 


Interestingly, inserting k = 2 in Eq. © gives rise to the simple and exact expression 

P/v(d > 2|d > 1) = (1 —p 2 ) N ~ 2 . (4) 

Each path of length k = 2 between nodes i and j consists of a single intermediate node 
and two edges. These paths do not overlap and are thus independent. Paths of lengths 
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k > 2 may share edges with other paths of the same length as well as with shorter paths. 
Therefore, in the calculation of the DSPL we use conditional probabilities to ensure that 
no shorter paths exist. This approach eliminates the correlations between paths of different 
lengths. On the other hand, nodes i and j may be connected by several paths of the 
same length, which may share some edges and thus become correlated. The RPA does not 
account for such correlations, because it assumes that the sub-networks of size N — 1 are 
independent. Averaging over the quenched randomness in each instance of such network, 
the RPA provides the distribution over an ensemble of networks. 

In the limit p —> 0 one can simplify the recursion equations and obtain the approximate 
closed form expression 


P N {d > k\d >k- 1) = (1 - ( 5 ) 

for any value of k. This expression is obtained using induction, based on Eq. m and 
the exact result given above for k = 2. This can be understood intuitively since the to¬ 
tal number of possible paths of length k between nodes i and j is given by the product 
(N — 2)... ( N — fc), and the probability for each of these paths to be connected is given by 
p k . This approximation breaks down for values of p which are not exceedingly small, where 
the correlations between different paths build up and cannot be ignored. 

The regime of sparse networks was studied extensively, focusing on the diameter (namely, 
the largest distance between any pair of nodes) of the giant cluster, which scales like a 
constant times In N, where the constant is 1/ Inc—2/Inc', where d < 1 satisfies the equation 
c' exp(— c') = cexp(—c) [l^- In the strongly connected regime, we focus on the case in which 
p = bN a ~^ , where b > 0 and 0 < a < 1. In this case the average degree increases with the 
network size as N a . We will now derive an asymptotic result for the limit N —> oo. In this 
limit p —> 0 and therefore the simplified results of Eq. ([5]) can be used. Plugging the scaling 
of p vs. N into Eq. © one obtains 


p N {d > k\d >k-l)~(l- N X-a) ) ■ ( 6 ) 

For N oo, P/v(d > k\d > k — 1) — > P(d > k\d > k — 1), where P(d > k\d > k — 1) = 1 for 
k < 1/a, exp (—6 1 /") for k = 1/a and 0 for k > 1/a. Note that the second case in the above 
equation is obtained only in the special case of a = 1/r, where r is an integer. Therefore, we 
will first consider the generic case in which a is not an exact inverse of an integer. Inserting 
the result for the conditional probabilities into Eq. m we obtain 

p (*) = {S < 7 > 

where [jeJ is the integer part of x. In case that a = 1/r we obtain that 

( 1 — e~ b : k = r 

P(k) = < e~ bT : k = r + 1 (8) 

[ 0 : otherwise. 

These results can be understood intuitively using the following argument. Starting from 
node i, we define the shell of radius d = 1 around it as the set of nodes which are directly 
connected to i. The expected value for the number of nodes in this shell is Ni ~ N a . 
Proceeding by induction, the shell of radius d is denoted as the set of nodes which are 
directly connected to nodes in the shell of radius d — 1. Thus, the number of nodes in the 
shell of radius d is given by A)/ ~ N da . In the asymptotic limit, as long as da < 1, the shell 
of radius d still consists of an exceedingly small fraction of the nodes in the network. On the 
other hand, once da > 1, this shell includes almost every node in the network. This means 
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that almost all the nodes in the network are at a distance d = |_1 /ckJ +1 from node i. Since 
node i was chosen at random, this means that the shortest path between almost any pair of 
nodes in the network is of length d. 


The case of a = 1/r, where r is an integer, requires a special consideration. Based on 
the argument presented above, the neighborhood of radius d = r from node i should include 
all the N nodes. However, this counting includes duplications, namely nodes which are 
connected to node i by several paths of length r. As a result, there are other nodes which 
are not reached by any of these paths. Since the number of nodes of distance r from node i 
scales with N, it is clear that each one of the remaining nodes is connected to at least one 
of them. Therefore, the remaining nodes are at a distance d = r + 1 from node i. 


Before presenting the results obtained from the two approaches, we refer to an earlier 
study of the DSPL in ER networks m ■ We briefly summarize their approach, adapting the 
notation where appropriate. The expectation value for the number of nodes at a distance 
k — 1 or less from the reference node is given by: n(k) = [1 — Tat(A: — l)]Ah This is due 
to the fact that the probability for a random node to be at a distance smaller than k is 
(1 — Fisr(k — 1)), and multiplying by N one obtains n(k). In order for a node to be at a 
distance larger than k from the reference node, it must not be directly connected to any of 
the n(k) nodes which are at distance k— 1 or less from the reference node. Picking a random 
node, the probability that it will not be connected to any of these nodes is given by m 


F N (k) = q ( 1 - F »( k - 1 ) W. 


(9) 


This recursion equation can be iterated, starting from Fjv(O) = (N — l)/iV, to obtain F/v(fc) 
for k = 1,2,.... A potential problem with this approach is that in the estimation of the 
probability, F^{k)^ that a random node will be at distance larger than k from the reference 
node, Eq. (0 ignores the possibility that the random node is already connected to the 
reference node by a path of length k — 1 or less. This is expected to bias the distribution 
towards larger distances. 


In Fig. 2 we present the tail distribution F/v(&) vs. k , for an ER network of N = 1000 
nodes and p = c/N , where c = 2.5, obtained from numerical simulations for all pairs of nodes 
(x) and for pairs of nodes on the same cluster (+). We also present the theoretical results 
obtained from the RSA (□). and from the RPA (o). The results of the RSA agree well with 
the numerical results for all pairs, except for the limit of large distances where the plateau 
in Eiv(fc) is lower than the empirical curve. It means that this approach underestimates the 
fraction of pairs for which dij = oo, which is equal to F{ oo). The results of the RPA agree 
well with the numerical results for pairs which reside on the same cluster. This is due to the 
fact that this approach reconstructs the remaining network at each iteration of the recursion 
equations. As a result, the quenched randomness of the connectivities in each realization of 
the network is annealed, eliminating the isolated nodes and the small, isolated clusters. In 
the RSA there is no such annealing. Therefore, the RSA applies to all pairs of nodes in the 
network while the RPA applies to pairs of nodes on the same cluster. In the limit of dense 
networks there are no isolated components and the two approaches coincide. 
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Fig. 2: (Color online) (a) The tail distribution Fjv(fc), vs. k for the ER (N,c/N) network with 
N = 1000 and c = 2.5, obtained from numerical simulations for all pairs of nodes (x) and for pairs 
of nodes on the same cluster (+). The results of the RSA (□) agree well with the numerical results 
for all pairs of nodes, except for the asymptotic tail. The results of the RPA (o) agree well with 
the numerical results for pairs of nodes on the same cluster. 


The distribution P/v(fc) can be characterized by its moments. The nth moment, ( k n ), can 
be obtained using the tail-sum formula ( k n ) = + — k n ]Fpf(k). In particular, the 

first moment is given by ( k ) = Ylk=o -^Jv(fc) and the second moment by ( k 2 ) = J2k=ofik + 
l)Fjv(fc). The width of the distribution can be characterized by the variance a 2 = (k 2 ) — (k) 2 . 
Related topological indices [20j such as the Wiener index [2T] and the Harary index [22][2~T| 
were studied in the context of chemical graphs. It was shown that important properties of 
molecules can be obtained using such indices for the graphs representing their structure [211 . 




Fig. 3: (Color online) The average ( d) (a) and the standard deviation a (b) of the DSPL in the 
ER(A r , bN a_1 ) network vs. 1/a for 6=1 and N = 10 3 (solid line) 10 6 (dashed line) and 10 9 (dotted 
line), obtained from the RPA. It is observed that ( d ) ~ [1/aJ + 1, decorated by a rounded step 
function, while a exhibits oscillations with maxima at integer values of 1/a. 


In Fig. 3(a) we present the average distance ( d ) between pairs of nodes vs. 1/a in 
dense ER networks. Following Eqs. m-m, these functions converge to a staircase form as 
N —> oo. In Fig. 3(b) we present the standard deviation cr vs. 1/a. For finite networks it 
exhibits oscillations of unit period. In the asymptotic limit the peaks become vanishingly 
narrow around the integers. 
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So far we have studied the DSPL between all pairs of nodes in the network. Below, we 
consider a reference node j of a known degree, to, and study the DSPL between this node 
and the rest of the network. We denote the DSPL between a random node i of degree to and 
other random nodes, j, by F N i rn (k) — F]y(k\deg(i) = to) and the corresponding conditional 
probability by P/v| m (d > k\d > k — 1). In this case, the first iteration of the recursion 
equation takes the form 

P N \ m (d > k\d > k - 1) = [P N -i(d > k—l\d> k - 2)] m , (10) 

where the expression on the right hand side is obtained from Eq. (j3j. In Fig. 4 we present 
the tail distribution F N \ m {k) vs. k, obtained from numerical simulations for m = 1 (+), 3 
(x) and 7 (*), in a dilute ER network of N = 1000 and c = 2.5. Each data point is averaged 
over 20 independent realizations of the network. The results of the RPA for m = 1 (o), 3 
(□) and 7 (o) are in good agreement with the numerical results. Clearly, the distribution is 
strongly affected by the local connectivity of the reference node. The knee of the distribution 
F N \m(k) (which coincides with the peak of the corresponding probability density function) 
moves to the left as m is increased. This means that nodes which are strongly connected at 
the local level are closer to the rest of the network than weakly connected nodes. 
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Fig. 4: (Color online) The DSPL F N \ m (k) vs. k between a random node i of a given degree, m, 
and all other nodes which reside on the same cluster in a dilute ER network of N = 1000 and 
c = 2.5. The results of the RPA for m = 1 (o), 3 (□) and 7 (o) are in good agreement with the 
corresponding numerical results: m — 1 (+), 3 (x) and 7 (*). 
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In summary, we have studied the distribution of shortest path lengths in ER networks us¬ 
ing two complementary theoretical approaches and showed that they are in good agreement 
with numerical results. For large and dense networks the distribution becomes extremely 
narrow and is exactly captured by both approaches. A slight modification enables us to cal¬ 
culate the DSPL around a node with a given degree, to. The results exemplify the impact 
of local features (such as the degree of a node) on global properties (such as the distance 
distribution) in complex networks. The proposed theoretical approaches are highly flexible 
and can be applied to more general networks B5 1 BB ] , 
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